Variable Importance Using Decision Trees
نویسندگان
چکیده
Decision trees and random forests are well established models that not only offer good predictive performance, but also provide rich feature importance information. While practitioners often employ variable importance methods that rely on this impurity-based information, these methods remain poorly characterized from a theoretical perspective. We provide novel insights into the performance of these methods by deriving finite sample performance guarantees in a high-dimensional setting under various modeling assumptions. We further demonstrate the effectiveness of these impurity-based methods via an extensive set of simulations.
منابع مشابه
Using Fractional Factorial Designs for Variable Importance in Random Forest Models
Random Forests are a powerful classification technique, consisting of a collection of decision trees. One useful feature of Random Forests is the ability to determine the importance of each variable in predicting the outcome. This is done by permuting each variable and computing the change in prediction accuracy before and after the permutation. This variable importance calculation is similar t...
متن کاملA bias correction algorithm for the Gini variable importance measure in classification trees
This paper considers a measure of variable importance frequently used in variable selection methods based on decision trees and tree-based ensemble models, like CART, Random Forests and Gradient Boosting Machine. It is defined as the total heterogeneity reduction produced by a given covariate on the response variable when the sample space is recursively partitioned. Some authors showed that thi...
متن کاملAnalysis of a bias effect in a tree-based variable impor- tance measure
The research in the field of data mining has widely addressed the problem of variable selection and several variable importance measures have been proposed in the literature. This paper deals with a frequently used variable importance measure defined in the context of decision trees and tree-based ensemble models like Random Forests and Treeboost. The aim of this paper is to show the existence ...
متن کاملA Random Forest Guided Tour
The random forest algorithm, proposed by L. Breiman in 2001, has been extremely successful as a general-purpose classification and regression method. The approach, which combines several randomized decision trees and aggregates their predictions by averaging, has shown excellent performance in settings where the number of variables is much larger than the number of observations. Moreover, it is...
متن کاملUsing Random Forests and Fuzzy Logic for Automated Storm Type Identification
This paper discusses how random forests, ensembles of weakly-correlated decision trees, can be used in concert with fuzzy logic concepts to both classify storm types based on a number of radar-derived storm characteristics and provide a measure of “confidence” in the resulting classifications. The random forest technique provides measures of variable importance and interactions, as well as meth...
متن کامل